{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div align=\"center\">\n",
    "<h2>BM25 RAG</h2>\n",
    "</div>\n",
    "\n",
    "<div align=\"center\">\n",
    "    <h3><a href=\"https://aiengineering.academy/\" target=\"_blank\">AI Engineering.academy</a></h3>\n",
    "</div>\n",
    "\n",
    "<div align=\"center\">\n",
    "<a href=\"https://aiengineering.academy/\" target=\"_blank\">\n",
    "<img src=\"https://raw.githubusercontent.com/adithya-s-k/AI-Engineering.academy/main/assets/banner.png\" alt=\"AI Engineering Academy\" width=\"50%\">\n",
    "</a>\n",
    "</div>\n",
    "\n",
    "<div align=\"center\">\n",
    "\n",
    "[![GitHub Stars](https://img.shields.io/github/stars/adithya-s-k/AI-Engineering.academy?style=social)](https://github.com/adithya-s-k/AI-Engineering.academy/stargazers)\n",
    "[![GitHub Forks](https://img.shields.io/github/forks/adithya-s-k/AI-Engineering.academy?style=social)](https://github.com/adithya-s-k/AI-Engineering.academy/network/members)\n",
    "[![GitHub Issues](https://img.shields.io/github/issues/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/issues)\n",
    "[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/pulls)\n",
    "[![License](https://img.shields.io/github/license/adithya-s-k/AI-Engineering.academy)](https://github.com/adithya-s-k/AI-Engineering.academy/blob/main/LICENSE)\n",
    "\n",
    "</div>\n",
    "\n",
    "## Introduction\n",
    "BM25 Retrieval-Augmented Generation (BM25 RAG) is an advanced technique that combines the power of the BM25 (Best Matching 25) algorithm for information retrieval with large language models for text generation. This approach enhances the accuracy and relevance of generated responses by grounding them in specific, retrieved information using a proven probabilistic retrieval model.\n",
    "\n",
    "This notebook aims to provide a clear and concise introduction to BM25 RAG, suitable for both beginners and experienced practitioners who want to understand and implement this technology.\n",
    "\n",
    "### Motivation\n",
    "\n",
    "Traditional RAG systems often use dense vector embeddings for retrieval, which can be computationally expensive and may not always capture the nuances of term importance. BM25 RAG addresses these limitations by using a probabilistic retrieval model that considers term frequency, inverse document frequency, and document length. This approach can lead to more accurate and interpretable retrieval, especially for queries requiring specific or rare information.\n",
    "\n",
    "### Method Details\n",
    "\n",
    "#### Document Preprocessing and Indexing\n",
    "\n",
    "1. **Document Chunking**: The knowledge base documents are preprocessed and split into manageable chunks to create a searchable corpus.\n",
    "   \n",
    "2. **Tokenization and Indexing**: Each chunk is tokenized, and an inverted index is created. The BM25 algorithm calculates term frequencies and inverse document frequencies.\n",
    "\n",
    "#### BM25 Retrieval-Augmented Generation Workflow\n",
    "\n",
    "1. **Query Input**: A user provides a query that needs to be answered.\n",
    "   \n",
    "2. **Retrieval Step**: The query is tokenized, and relevant documents are retrieved using the BM25 scoring algorithm. This step considers term frequency, inverse document frequency, and document length to find the most relevant chunks.\n",
    "\n",
    "3. **Generation Step**: The retrieved document chunks are passed to a large language model as additional context. The model uses this context to generate a more accurate and relevant response.\n",
    "\n",
    "### Key Features of BM25 RAG\n",
    "\n",
    "1. **Probabilistic Retrieval**: BM25 uses a probabilistic model to rank documents, providing a theoretically sound basis for retrieval.\n",
    "   \n",
    "2. **Term Frequency Saturation**: BM25 accounts for diminishing returns from repeated terms, improving retrieval quality.\n",
    "\n",
    "3. **Document Length Normalization**: The algorithm considers document length, reducing bias towards longer documents.\n",
    "\n",
    "4. **No Embedding Required**: Unlike vector-based approaches, BM25 doesn't require document embeddings, which can be computationally efficient.\n",
    "\n",
    "### Benefits of this Approach\n",
    "\n",
    "1. **Improved Accuracy**: Combines the strengths of probabilistic retrieval and neural text generation.\n",
    "\n",
    "2. **Interpretability**: BM25 scoring provides a more interpretable retrieval process compared to dense vector retrieval methods.\n",
    "\n",
    "3. **Effective for Long-tail Queries**: Particularly good at handling queries requiring specific or rare information.\n",
    "\n",
    "### Conclusion\n",
    "\n",
    "BM25 Retrieval-Augmented Generation represents a powerful fusion of classic information retrieval techniques and modern language models. By leveraging the strengths of the BM25 algorithm, this approach offers improved accuracy, interpretability, and efficiency in various natural language processing tasks. As AI continues to evolve, BM25 RAG stands out as a robust method for building more reliable and context-sensitive AI systems, especially in domains where precise information retrieval is crucial.\n",
    "\n",
    "### Prerequisites\n",
    "- Preferably Python 3.11\n",
    "- Jupyter Notebook or JupyterLab\n",
    "- LLM API Key\n",
    "    - You can use any LLM of your choice; in this notebook, we use OpenAI's GPT models\n",
    "\n",
    "With these steps, you can implement a BM25 RAG system to enhance the capabilities of language models by incorporating efficient, probabilistic information retrieval, improving their effectiveness in various applications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install llama-index\n",
    "# !pip install llama-index-retrievers-bm25\n",
    "# !pip install llama-index-vector-stores-qdrant \n",
    "# !pip install llama-index-readers-file \n",
    "# !pip install llama-index-embeddings-fastembed \n",
    "# !pip install llama-index-llms-openai\n",
    "# !pip install llama-index-llms-groq\n",
    "# !pip install -U qdrant_client fastembed\n",
    "# !pip install python-dotenv\n",
    "# !pip install matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/adithya/miniconda3/envs/basic-rag/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n"
     ]
    }
   ],
   "source": [
    "# Standard library imports\n",
    "import logging\n",
    "import sys\n",
    "import os\n",
    "\n",
    "# Third-party imports\n",
    "from dotenv import load_dotenv\n",
    "from IPython.display import Markdown, display\n",
    "\n",
    "# Qdrant client import\n",
    "import qdrant_client\n",
    "\n",
    "# LlamaIndex core imports\n",
    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
    "from llama_index.core import Settings\n",
    "\n",
    "# LlamaIndex vector store import\n",
    "from llama_index.vector_stores.qdrant import QdrantVectorStore\n",
    "\n",
    "# Embedding model imports\n",
    "from llama_index.embeddings.fastembed import FastEmbedEmbedding\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "\n",
    "# LLM import\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.llms.groq import Groq\n",
    "# Load environment variables\n",
    "load_dotenv()\n",
    "\n",
    "# Get OpenAI API key from environment variables\n",
    "OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\")\n",
    "# GROK_API_KEY = os.getenv(\"GROQ_API_KEY\")\n",
    "\n",
    "# Setting up Base LLM\n",
    "Settings.llm = OpenAI(\n",
    "    model=\"gpt-4o-mini\", temperature=0.1, max_tokens=8096, streaming=True\n",
    ")\n",
    "\n",
    "# Settings.llm = Groq(model=\"llama3-70b-8192\" , api_key=GROK_API_KEY)\n",
    "\n",
    "# Set the embedding model\n",
    "# Option 1: Use FastEmbed with BAAI/bge-base-en-v1.5 model (default)\n",
    "# Settings.embed_model = FastEmbedEmbedding(model_name=\"BAAI/bge-base-en-v1.5\")\n",
    "\n",
    "# Option 2: Use OpenAI's embedding model (commented out)\n",
    "# If you want to use OpenAI's embedding model, uncomment the following line:\n",
    "Settings.embed_model = OpenAIEmbedding(embed_batch_size=10, api_key=OPENAI_API_KEY)\n",
    "\n",
    "# Qdrant configuration (commented out)\n",
    "# If you're using Qdrant, uncomment and set these variables:\n",
    "# QDRANT_CLOUD_ENDPOINT = os.getenv(\"QDRANT_CLOUD_ENDPOINT\")\n",
    "# QDRANT_API_KEY = os.getenv(\"QDRANT_API_KEY\")\n",
    "\n",
    "# Note: Remember to add QDRANT_CLOUD_ENDPOINT and QDRANT_API_KEY to your .env file if using Qdrant Hosted version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading files: 100%|██████████| 1/1 [00:00<00:00,  6.11file/s]\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "# load documents\n",
    "documents = SimpleDirectoryReader(\"../data\", recursive=True).load_data(show_progress=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parsing nodes: 100%|██████████| 58/58 [00:00<00:00, 14822.67it/s]\n",
      "Generating embeddings: 100%|██████████| 58/58 [00:01<00:00, 31.58it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of Nodes: 58\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.node_parser import TokenTextSplitter\n",
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "from llama_index.core.node_parser import MarkdownNodeParser\n",
    "from llama_index.core.node_parser import SemanticSplitterNodeParser\n",
    "from llama_index.core.ingestion import IngestionPipeline\n",
    "\n",
    "pipeline = IngestionPipeline(\n",
    "    transformations=[\n",
    "        MarkdownNodeParser(include_metadata=True),\n",
    "        # TokenTextSplitter(chunk_size=500, chunk_overlap=20),\n",
    "        # SentenceSplitter(chunk_size=1024, chunk_overlap=20),\n",
    "        # SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95 , embed_model=Settings.embed_model),\n",
    "        Settings.embed_model,\n",
    "    ],\n",
    ")\n",
    "\n",
    "# Ingest directly into a vector db\n",
    "nodes = pipeline.run(documents=documents , show_progress=True)\n",
    "print(\"Number of Nodes:\",len(nodes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize a docstore to store nodes\n",
    "# also available are mongodb, redis, postgres, etc for docstores\n",
    "import asyncio\n",
    "from llama_index.core.storage.docstore import SimpleDocumentStore\n",
    "\n",
    "docstore = SimpleDocumentStore()\n",
    "docstore.add_documents(nodes)\n",
    "docstore.persist(persist_path=\"./docstore.json\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Mongo DB as Document Store\n",
    "\n",
    "# !pip install llama-index-storage-index-store-mongodb\n",
    "# !pip install llama-index-storage-docstore-mongodb\n",
    "\n",
    "# from llama_index.storage.docstore.mongodb import MongoDocumentStore\n",
    "# from llama_index.storage.kvstore.mongodb import MongoDBKVStore\n",
    "# from pymongo import MongoClient\n",
    "# from motor.motor_asyncio import AsyncIOMotorClient\n",
    "\n",
    "# MONGO_URI = os.getenv(\"MONGO_URI\")\n",
    "# kv_store = MongoDBKVStore(mongo_client=MongoClient(MONGO_URI) , mongo_aclient=AsyncIOMotorClient(MONGO_URI))\n",
    "# docstore = MongoDocumentStore(namespace=\"BM25_RAG\" ,mongo_kvstore=kv_store).from_uri(uri=MONGO_URI)\n",
    " \n",
    "# docstore.add_documents(nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # !pip install llama-index-storage-docstore-redis\n",
    "# # !pip install llama-index-storage-index-store-redis\n",
    "# from llama_index.storage.docstore.redis import RedisDocumentStore\n",
    "\n",
    "# REDIS_HOST = os.getenv(\"REDIS_HOST\", \"localhost\")\n",
    "# REDIS_PORT = os.getenv(\"REDIS_PORT\", 6379)\n",
    "\n",
    "# docstore=RedisDocumentStore.from_host_and_port(\n",
    "#         host=REDIS_HOST, port=REDIS_PORT, namespace=\"BM25_RAG\"\n",
    "#     )\n",
    "# docstore.add_documents(nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.retrievers.bm25 import BM25Retriever\n",
    "import Stemmer\n",
    "\n",
    "# We can pass in the index, docstore, or list of nodes to create the retriever\n",
    "bm25_retriever = BM25Retriever.from_defaults(\n",
    "    docstore=docstore,\n",
    "    similarity_top_k=4,\n",
    "    # Optional: We can pass in the stemmer and set the language for stopwords\n",
    "    # This is important for removing stopwords and stemming the query + text\n",
    "    # The default is english for both\n",
    "    stemmer=Stemmer.Stemmer(\"english\"),\n",
    "    language=\"english\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 04328457-baaf-4ee5-be1b-70f604c2fe05<br>**Similarity:** 1.7577731609344482<br>**Text:** Authors\n",
       "\n",
       "Ashish Vaswani*\n",
       "\n",
       "Noam Shazeer*\n",
       "\n",
       "Niki Parmar*\n",
       "\n",
       "Jakob Uszkoreit*\n",
       "\n",
       "Google Brain\n",
       "\n",
       "avaswani@google.com\n",
       "\n",
       "noam@google.com\n",
       "\n",
       "nikip@google.com\n",
       "\n",
       "usz@google.com\n",
       "\n",
       "Llion Jones*\n",
       "\n",
       "Aidan N. Gomez* †\n",
       "\n",
       "Łukasz Kaiser*\n",
       "\n",
       "Google Research\n",
       "\n",
       "University of Toronto\n",
       "\n",
       "llion@google.com\n",
       "\n",
       "aidan@cs.toronto.edu\n",
       "\n",
       "lukaszkaiser@google.com\n",
       "\n",
       "Illia Polosukhin* ‡\n",
       "\n",
       "illia.polosukhin@gmail.com<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** a2d32369-caf0-4529-ab20-dfd7b49a7705<br>**Similarity:** 1.4452868700027466<br>**Text:** 5.2 Hardware and Schedule\n",
       "\n",
       "We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models, (described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.core.response.notebook_utils import display_source_node\n",
    "\n",
    "# will retrieve context from specific companies\n",
    "retrieved_nodes = bm25_retriever.retrieve(\n",
    "    \"Who are the Authors of this paper\"\n",
    ")\n",
    "for node in retrieved_nodes:\n",
    "    display_source_node(node, source_length=5000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Node ID:** e3519af3-3040-4fb5-84d3-485887327d61<br>**Similarity:** 2.1414361000061035<br>**Text:** What we are missing\n",
       "\n",
       "In my opinion...\n",
       "\n",
       "Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.\n",
       "\n",
       "15<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 0fd6220c-d85e-4ede-8f33-c1237de9709b<br>**Similarity:** 1.9467532634735107<br>**Text:** Input-Input Layer 5\n",
       "\n",
       "The law will never be perfect, but its application should be just.\n",
       "\n",
       "This is what we are missing, in my opinion.<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 403d39f5-3650-4caa-bce7-cda0bbd11496<br>**Similarity:** 1.4834907054901123<br>**Text:** Attention Visualizations\n",
       "\n",
       "It is this spirit that a majority of American governments have passed new laws since 2009.\n",
       "\n",
       "Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads. Best viewed in color.\n",
       "\n",
       "Voting process more difficult.\n",
       "---<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** 65456aec-30fc-4363-84f1-d1caa5fcaa28<br>**Similarity:** 1.2847681045532227<br>**Text:** 2 Background\n",
       "\n",
       "The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.\n",
       "\n",
       "Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].\n",
       "\n",
       "End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].\n",
       "\n",
       "To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "retrieved_nodes = bm25_retriever.retrieve(\"What is Attention mechanism\")\n",
    "for node in retrieved_nodes:\n",
    "    display_source_node(node, source_length=5000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.query_engine import RetrieverQueryEngine\n",
    "from llama_index.core import get_response_synthesizer\n",
    "from llama_index.core.response_synthesizers import ResponseMode\n",
    "\n",
    "response_synthesizer = get_response_synthesizer(\n",
    "    response_mode=ResponseMode.COMPACT_ACCUMULATE\n",
    ")\n",
    "\n",
    "BM25_QUERY_ENGINE = RetrieverQueryEngine(\n",
    "    retriever=bm25_retriever,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = BM25_QUERY_ENGINE.query(\"How many encoders are stacked in the transformer?\")\n",
    "display(Markdown(str(response)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "build-venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
